2,241 research outputs found

    Version Control of Speaker Recognition Systems

    Full text link
    This paper discusses one of the most challenging practical engineering problems in speaker recognition systems - the version control of models and user profiles. A typical speaker recognition system consists of two stages: the enrollment stage, where a profile is generated from user-provided enrollment audio; and the runtime stage, where the voice identity of the runtime audio is compared against the stored profiles. As technology advances, the speaker recognition system needs to be updated for better performance. However, if the stored user profiles are not updated accordingly, version mismatch will result in meaningless recognition results. In this paper, we describe different version control strategies for different types of speaker recognition systems, according to how they are deployed in the production environment

    Health Diagnostics Using User Utterances

    Get PDF
    Respiratory illnesses can be hard to track and diagnose. Obtaining useful clinical data on these illnesses is difficult because it requires physical interaction, e.g., via nasal or sinus swab. It is known that respiratory illness can impact speech pathways. To this end, this disclosure describes techniques to use readily accessible software to obtain and classify potentially useful data. With user permission, utterances of the user, e.g., activation of a speech-activated device via a hotword, are analyzed to form speaker-ID models. These models are evaluated against additional utterances of the user in a sequential manner. The evaluation scores, along with the timestamps and details of the models, are aggregated to determine if the user has an interval of time where their speaker-ID models are unstable, inconsistent, or lacking self-similarity. This signal can be used as a proxy for detection or as a motivating factor for clinical investigation

    Attention-Based Models for Text-Dependent Speaker Verification

    Full text link
    Attention-based models have recently shown great performance on a range of tasks, such as speech recognition, machine translation, and image captioning due to their ability to summarize relevant information that expands through the entire length of an input sequence. In this paper, we analyze the usage of attention mechanisms to the problem of sequence summarization in our end-to-end text-dependent speaker recognition system. We explore different topologies and their variants of the attention layer, and compare different pooling methods on the attention weights. Ultimately, we show that attention-based models can improves the Equal Error Rate (EER) of our speaker verification system by relatively 14% compared to our non-attention LSTM baseline model.Comment: Submitted to ICASSP 201

    Secure audio processing

    Get PDF
    Automatic speech recognizers (ASR) are now nearly ubiquitous, finding application in smart assistants, smartphones, smart speakers, and other devices. An attack on an ASR that triggers such a device into carrying out false instructions can lead to severe consequences. Typically, speech recognition is performed using machine learning models, e.g., neural networks, whose intermediate outputs are not always fully concealed. Exposing such intermediate outputs makes the crafting of malicious input audio easier. This disclosure describes techniques that thwart attacks on speech recognition systems by moving model inference processing to a secure computing enclave. The memory of the secure enclave and signals are inaccessible to the user and untrusted processes, and therefore, resistant to attacks
    • …
    corecore